Work in progress: upgrade to astropy ecosystem structure #3

bmorris3 · 2019-11-08T04:33:31Z

I'm a new postdoc working with Kevin Heng, nice to e-meet you @pmneila 👋 . One of my projects is to bring ESP services up to compliance with astropy developer standards, and HELA is the first toolkit I'm starting to work with. I hope for this PR to be the beginning of a conversation about tweaks we can make to the package. Ideally we can collaborate over GitHub, but I'm happy to meet in person if that would be helpful.

This PR is a modest overhaul of the formatting within HELA, and a reorganization of the package structure. There are a few broad goals of the PR:

make HELA structure closer to the traditional python package structure
make a Pythonic API that exposes HELA's data products to users in an interactive/notebook environment, rather than the legacy command-line interface (e.g.: so you can make your own plots with HELA's outputs)
add integrations to travis-ci and readthedocs for continuous integration testing and self-building sphinx documentation, respectively

I've moved a copy of the present implementation and README of HELA into a directory called legacy, so the old functionality is preserved

New API

The API now allows for a workflow that looks like this:

# Generate an example dataset directory
from hela import generate_example_data
example_dir, training_dataset, samples_path = generate_example_data()

# Import RandomForest object from HELA
from hela import RandomForest
import matplotlib.pyplot as plt

# Initialize a random forest object:
rf = RandomForest(training_dataset, example_dir, samples_path)

# Train the random forest:
r2scores = rf.train(num_trees=1000, num_jobs=1)
plt.show()

# Predict posterior distributions from random forest
posterior = rf.predict(plot_posterior=True)
posterior_slopes, posterior_intercepts = posterior.samples.T
plt.show()

The train and predict methods on the RandomForest object work just like the rfretrieval.py train and rfretrieval.py predict functions worked in the legacy version.

Documentation

There's now a docs directory which contains sphinx self-building documentation which you can build locally on your copy of this branch with

python setup.py build_docs
open docs/_build/html/index.html

The narrative documentation starts with an example that @cefisher and I worked through today where we demonstrate fitting a line (by predicting the slope and intercept parameters using a random forest) implemented by HELA.

GitHub integrations/webhooks

After this PR is merged, we can update the GitHub webhooks/integrations to get travis and readthedocs up and running.

bmorris3 · 2019-11-08T11:14:07Z

I've just added a first (very generously unrigorous) end-to-end test which you can run with

python setup.py test

It runs the linear regression example used in the docs and checks that:

the R^2 values are near unity for the slope and intercept
the slope and intercept are close to the true (input) values

bmorris3 · 2019-11-08T13:07:54Z

After a struggle, the package now passes its first continuous integration tests! See passing travis builds here, which include a format/style checker (via flake8), and an end-to-end functional test of the entire module (via pytest). 🎉

pmneila

Hi @bmorris3,

Please to e-meet you as well :). I went quickly through your changes. It looks great.

I left some minor comments, specially regarding the names of a few functions/classes that we can discuss.

hela/forest.py

hela/models.py

hela/dataset.py

hela/forest.py

pmneila

Hi @bmorris3

I left a few more comments. My biggest concerns are about the necessity of the class Model. I think we could get rid of it and just provide some auxiliary functions, since the main functionality is already in the RandomForestWrapper, including saving/loading. We can discuss it in the next meeting on Wednesday, if you are around.

I can take care of the implementation details if you want. But maybe it's better to wait until the user interface is clear.

hela/model.py

pmneila · 2019-11-11T17:24:18Z

hela/model.py

+    return ranges.T
+
+
+def generate_example_data():


This function is generating data with no noise and with independent slopes and ints, which makes the posterior computation pointless. I think generate_example_data should build an example dataset that show the ability of estimating posteriors, which is the main strength of this library. Ideally a dataset with some correlation between slopes and ints that show up in the posterior?

I've been trying to craft an example with correlations between slopes and ints but haven't been able to create a good working example. Would you be able to give me some more hints how to demonstrate this?

Hi @bmorris3,

Sorry, I have realized that my comment was very vague and unclear. Something like this might work:

def generate_example_data(): """ Generate an example dataset in the new directory ``linear_dataset``. Returns ------- example_dir : str Path to the directory of the example data training_dataset : str Path to the dataset metadata JSON file samples : `~numpy.ndarray` """ example_dir = 'linear_dataset' training_dataset = os.path.join(example_dir, 'example_dataset.json') os.makedirs(example_dir, exist_ok=True) # Save the dataset metadata to a JSON file dataset = { "metadata": { "names": ["slope", "intercept"], "ranges": [[0, 1], [0, 1]], "colors": ["#F14532", "#4a98c9"], "num_features": 1000 }, "training_data": "training.npy", "testing_data": "testing.npy" } with open(training_dataset, 'w') as fp: json.dump(dataset, fp) # Generate fake training data npoints = 20000 slopes = np.random.uniform(size=npoints) ints = np.random.uniform(size=npoints) x = np.linspace(0, 1, 1000)[:, np.newaxis] # Add correlated noise to parameters to introduce degeneracies noise_ints = np.random.normal(scale=0.15, size=npoints) noise_slopes = np.abs(noise_ints) + np.random.normal(scale=0.02, size=npoints) # Add also noise to data points (not strictly necessary) data = ((slopes + noise_slopes) * x + (ints + noise_ints) + np.random.normal(scale=0.01, size=(1000, npoints))) labels = np.vstack([slopes, ints]) X = np.vstack([data, labels]) # Split dataset into training and testing segments training = X[:, :int(0.8 * npoints)].T testing = X[:, int(0.8 * npoints):].T np.save(os.path.join(example_dir, 'training.npy'), training) np.save(os.path.join(example_dir, 'testing.npy'), testing) # Generate a bunch of samples with a test value to "retrieve" with the # random forest: true_slope = 0.7 true_intercept = 0.5 samples = true_slope * x + true_intercept return training_dataset, example_dir, samples.T[0]

The posteriors would look like this:

hela/wrapper.py

hela/model.py

pmneila · 2019-11-11T18:25:25Z

hela/model.py

+    return np.array([forest_i.feature_importances_ for forest_i in forests])
+
+
+class Model(object):


I'm not sure about how the user should use this class. Is it that important to have a class like this? As a user of an API, I would expect having more fine-grained control.

It saves the model during training. While this makes sense in command like, a user now might want to train a predict in a single session, with no saving involved. I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.

It assumes the format of the dataset. The user might want to use a different format.

It loads the saved model from a specific location every time predict is called, which makes no sense if the user does not need to save the model.

It only allows to predict from a single data_file given in the constructor.

The code from the train method, for example, should be what the user might write to train a model and build the predicted_vs_real plots. But I don't see why we should give that as a method of a class.

You're right, we should definitely take a closer look at this class.

I think the user should be responsible of printing/saving when and where they want. Of course, we could give auxiliary functions to help with saving and printing.

I don't feel strongly about this either way. I think there may still be use cases where the saving is helpful – for example when the training takes a very long time to run. Whatever you think makes sense 👍

It assumes the format of the dataset. The user might want to use a different format.

I'm not sure I understand – can't the user supply arbitrary x and y (in the correct shapes)?

It only allows to predict from a single data_file given in the constructor.

Good point. Could the solution to all of the above points be to not expose this class to the user, and just use it internally?

I don't even think we need this class internally. Currently we expect the user to use the library like this:

from hela import Retrieval # Here, Retrieval requires that the user uses our JSON format to describe # the dataset. This is what I meant by "assumes the format of the dataset". # They need to write a JSON file linking to NPY files that have a specific # organization (first columns are features, last columns are labels). This is very # restrictive. r = Retrieval("dataset/dataset.json", "output_dir") # Train the random forest r2scores = r.train(num_trees=1000, num_jobs=1) fig = r.plot_correlations() fig.savefig("corr.pdf") plt.show() data = np.load("query.npy") # Currently r.predict always loads the model from disk, # which might be unnecessary. posterior = r.predict(data) # Do something with the posterior...

I find this very limited, but also unclear, as many non-obvious, implicit operations are happening under the hood (saving the model and building R2 scores in train, loading the model and printing percentiles in predict...)

Instead, without Retrieval, the user could do this:

import hela # Load the dataset # We offer this function for convenience, but this is optional. The user # might write their own load_dataset, or read the data from NPY files. dataset = hela.load_dataset("dataset/training_dataset.json") # Train the model # Again, this is a helper function. To have a more fine grained control, the user # can create a PosteriorRandomForest and call the fit method. forest = hela.train_model(dataset, num_trees=1000, num_jobs=1) # Print some info info print(forest.oob_score_) ... # Save the model. Optional. hela.save("output/model.pkl", forest) # Run validation dataset = hela.load_dataset("dataset/testing_dataset.json") pred, r2scores = hela.test_model(forest, dataset) # Plot fig = hela.plot_predicted_vs_real(dataset.y, pred, dataset.names, dataset.ranges) fig.savefig("output/r2.pdf") # Prediction data = np.load("query.npy") # No load required here, but we can load a model with forest = hela.load("output/model.pkl") posterior = forest.predict_posterior(data) # or hela.predict_posterior(forest, data) to be consistent with other functions. # Do something with the posterior... posterior_ranges = hela.posterior_percentiles(posterior) # or posterior.percentiles()

I think this approach gives complete control to the user, and at the same time it is more clear, as every operation is explicit and only happens if the user asks for it.

If you are fine with the second approach and it suits the astropy developer standards, we can remove the class Retrieval and prepare the collection of helper functions (save, load...).

Sorry for the long delay in response before I got back to you on this, it's been a busy few weeks!

I see the advantages to the above approach, but I'm still searching for a more object-oriented project structure and workflow. Let me see if I can split the difference between our proposals.

The latest refactor looks like:

import matplotlib.pyplot as plt from hela import Retrieval, Dataset, generate_example_data # Create an example dataset training_dataset, example_dir, example_data = generate_example_data() # Load the dataset dataset = Dataset.load_json("linear_dataset/example_dataset.json") # Train the model r = Retrieval() r2scores = r.train(dataset, num_trees=1000, num_jobs=5) # Print OOB score (optional) print(r.oob) # Plot predicted vs real: fig, ax = r.plot_predicted_vs_real(dataset) # Save the model (optional) # from hela import save_model # save_model("linear_dataset/model.pkl", r.model) # Predict posterior distribution for slope and intercept of example data posterior = r.predict(example_data) # Print posterior ranges (optional) posterior.print_percentiles(dataset.names) fig2, ax2 = posterior.plot_posterior_matrix(dataset) plt.show()

Is that a sufficient compromise between the explicit function calls you're looking for and the object-oriented structure I'm looking for? Again, happy to iterate 👍

hela/wrapper.py

bmorris3 · 2020-04-14T14:09:41Z

Hey @pmneila – perhaps we can move forward with this PR, if you have a minute?

pmneila · 2020-04-15T04:50:02Z

Hi @bmorris3

Sure, let's finish this.

BTW, Following some requests from @cefisher, I've implemented some additional functionality into HELA. However, this PR was created before those changes were implemented. How do you think we should proceed? Do we finish with this PR with no extra functionality and then we integrate it in subsequent PRs? I think that's the best option.

bmorris3 · 2020-04-15T07:39:50Z

I'm happy to go with your best judgement! Would that be much extra work for you? I don't know how big the merge conflicts are.

pmneila

The merge conflicts are going to be huge. 😅 After going through the code, I still think the best solution is to continue working on this PR and then I'll integrate the new functionality following this new API as much as possible.

There are a few things in the new code that you might find relevant at this point:

It will be necessary to pass the training data at prediction time (so we can compute things like the best-fit model (@cefisher can give you more details on this if you want)). One of the consequences of this is that we won't need to store names, ranges and colors in the RandomForestPosterior, since they will be available at test time in the dataset. I think this is actually much better. I found it very weird that the forest should save that information.
I split the classes PosteriorRandomForest (old Model) and Posterior into two different files. I don't think this is problem, but maybe you want to keep everything in posteriors.py?
I added type annotations. Not sure if that might be a problem with Astropy?

I went through the code and I think we are almost there. I added a few comments below.

pmneila · 2020-04-16T01:57:47Z

hela/plot.py

@@ -18,7 +18,7 @@
 POSTERIOR_MAX_SIZE = 10000


-def plot_predicted_vs_real(y_real, y_pred, names, ranges, alpha='auto'):
+def plot_predicted_vs_real(dataset, retrieval, alpha='auto'):


retrieval is only used to get the prediction. Pass the prediction instead.

pmneila · 2020-04-16T02:15:13Z

hela/posteriors.py

@@ -179,6 +180,46 @@ def __init__(self, samples, weights):
        self.samples = samples
        self.weights = weights

+    def print_percentiles(self, names):


This function prints to the standard output. Code is much simpler if all functions with side effects (such as I/O) are present only in the high-level classes of the code (in this case, in the class Retrieval). Posterior is a low-level class, used internally to represent posteriors. A method like print_percentiles makes sense only in the context of the class Retrieval, but it is very odd for any other user of this class.

I understand that what you wanted to to is to allow the final user to do something like:

posterior = retrieval.predict(example_data) posterior.print_percentiles(dataset.names)

I can think of two possible solutions. The simplest one is changing the interface to something like:

posterior = retrieval.predict(example_data) retrieval.print_percentiles(dataset.names, posterior)

This also is coherent with the fact that all other functions for printing/plotting are in Retrieval.

The other solution, if you really want to keep the original API, would be encapsulating the posterior in a more specific class of posteriors for retrieval (RetrievalPosterior?) that contains all the methods that make sense in the context of the class Retrieval.

pmneila · 2020-04-16T02:16:45Z

hela/posteriors.py

+            [values[0], values[2] - values[0], values[0] - values[1]])
+        return ranges.T
+
+    def plot_posterior_matrix(self, dataset):


As in print_percentiles, this function performs I/O in the class Posterior. Same things apply here.

Also, this function leads to a circular dependency in the code, since posteriors.py now depends on plot.py and vice versa.

pmneila · 2020-04-16T02:19:59Z

hela/posteriors.py

+                  "[+{:.3g} -{:.3g}]".format(name_i, *pred_range_i))
+
+    def data_ranges(self, percentiles=(50, 16, 84)):
+        """


This method is fine, but it seems a bit strange that this method is in the class Posterior and the function resample_posterior isn't. Rather put them both as separated functions or both as methods of this class? I have a slight preference towards independent functions and against classes, but if you prefer methods it'll be fine for me.

pmneila · 2020-04-16T02:25:04Z

hela/retrieval.py


        # saving model information...
        self.oob = self.model.rf.oob_score_

-        pred, r2scores = test_model(self.model, self.dataset)
+        pred, r2scores = test_model(self.model, dataset)
        self.pred = pred
        return r2scores


Why don't we also return pred? Not saying that we must do it, but if we return r2scores, it seems natural.

bmorris3 added 26 commits November 7, 2019 20:23

Style tweaks to all files

fc6e9db

Building CI environment

87b1605

Adding hela module

d8e4844

Adding sphinx documentation

90ad594

First self-building docs with plots

1c52785

Putting in 'fitting a line' header

9c546b0

Adding logo to docs page

17e5d04

Adding upstream changes to the hela package

28ab4e0

Updating docstrings

eff1066

Removing accidentally committed version.py

5499ec9

API updates to docs, docstrings

9378719

Better docstrings, patching tutorial for Pablo's update

e33a088

More docstrings, explicit class definitions rather than namedtuples

4b4576a

Defining Posterior class

2a13bd6

Better code formatting

0bdc590

updating Posterior API

1b0e2c7

Removing redundant train step, removing empty notebook

83d72fb

Updating readme, moving old code to legacy dir

6cf54dc

readme tweaks

ce1801e

readme tweaks

a4d4103

Moving final file into legacy dir

bed2882

Fixing legacy README, adding developer docs

102f303

tweaking docs build

df9d6b0

Fixing up package-template structure

9cb3d80

Adding logo back into docs/conf.py

afce4ff

Adding first end-to-end test which checks against linear docs example

652179f

bmorris3 added 3 commits November 8, 2019 13:33

Fixing astropy helpers version

c01c96b

changing url for ah submodule

5cfa38a

Formatting fixes

4d4de14

bmorris3 added 3 commits November 8, 2019 13:51

flake8 improvements

7f11aef

Adding missing testing matrix dependencies

bee11bb

Editting end-to-end test to improve reliability

7001cdc

pmneila reviewed Nov 8, 2019

View reviewed changes

bmorris3 added 8 commits November 8, 2019 16:01

Splitting out plotting functions from the calculation functions

50bc2c6

Updating file/object names with @pmneila's help

c9bcdc3

Fixing tutorial with new object names

8a6c4ed

flake8 fix

f2b7b47

Fixing important typo in test

739d065

Fixing typos in docs

296d476

Renaming Model -> Retrieval

c5e62d2

Reordering returned values from generate_example_data

8223dab

pmneila reviewed Nov 11, 2019

View reviewed changes

bmorris3 added 6 commits November 12, 2019 09:09

Naming tweaks from @pmneila

f98e83c

API, example data tweaks

cc7168b

Adding missing training line from docs build

6268e84

Fix broken test due to update to generate_example_data

5140253

Object-oriented refactor, but a bit more explicit

1b5267c

minor flake8 (formatting) fixes

8b51404

pmneila reviewed Apr 16, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work in progress: upgrade to astropy ecosystem structure #3

Work in progress: upgrade to astropy ecosystem structure #3

bmorris3 commented Nov 8, 2019

bmorris3 commented Nov 8, 2019

bmorris3 commented Nov 8, 2019

pmneila left a comment

pmneila left a comment

pmneila Nov 11, 2019

bmorris3 Nov 12, 2019

pmneila Nov 14, 2019

pmneila Nov 11, 2019

bmorris3 Nov 12, 2019 •

edited

Loading

pmneila Nov 13, 2019

bmorris3 Dec 1, 2019

bmorris3 Dec 1, 2019

bmorris3 commented Apr 14, 2020

pmneila commented Apr 15, 2020

bmorris3 commented Apr 15, 2020

pmneila left a comment

pmneila Apr 16, 2020

pmneila Apr 16, 2020

pmneila Apr 16, 2020

pmneila Apr 16, 2020

pmneila Apr 16, 2020

		return np.array([forest_i.feature_importances_ for forest_i in forests])


		class Model(object):

Work in progress: upgrade to astropy ecosystem structure #3

Are you sure you want to change the base?

Work in progress: upgrade to astropy ecosystem structure #3

Conversation

bmorris3 commented Nov 8, 2019

New API

Documentation

GitHub integrations/webhooks

bmorris3 commented Nov 8, 2019

bmorris3 commented Nov 8, 2019

pmneila left a comment

Choose a reason for hiding this comment

pmneila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmorris3 Nov 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmorris3 commented Apr 14, 2020

pmneila commented Apr 15, 2020

bmorris3 commented Apr 15, 2020

pmneila left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bmorris3 Nov 12, 2019 •

edited

Loading